healthcare data
Adaptive Conformal Prediction via Bayesian Uncertainty Weighting for Hierarchical Healthcare Data
Shahbazi, Marzieh Amiri, Baheri, Ali, Azadeh-Fard, Nasibeh
Clinical decision-making demands uncertainty quantification that provides both distribution-free coverage guarantees and risk-adaptive precision, requirements that existing methods fail to jointly satisfy. We present a hybrid Bayesian-conformal framework that addresses this fundamental limitation in healthcare predictions. Our approach integrates Bayesian hierarchical random forests with group-aware con-formal calibration, using posterior uncertainties to weight conformity scores while maintaining rigorous coverage validity. Evaluated on 61,538 admissions across 3,793 U.S. hospitals and 4 regions, our method achieves target coverage (94.3% vs 95% target) with adaptive precision: 21% narrower intervals for low-uncertainty cases while appropriately widening for high-risk predictions. Critically, we demonstrate that well-calibrated Bayesian uncertainties alone severely under-cover (14.1%), highlighting the necessity of our hybrid approach. This framework enables risk-stratified clinical protocols, efficient resource planning for high-confidence predictions, and conservative allocation with enhanced oversight for uncertain cases, providing uncertainty-aware decision support across diverse healthcare settings.
- North America > United States > New York > Monroe County > Rochester (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
MedEqualizer: A Framework Investigating Bias in Synthetic Medical Data and Mitigation via Augmentation
Salarian, Sama, Zhang, Yue, Padhee, Swati, Parthasarathy, Srinivasan
Synthetic healthcare data generation presents a viable approach to enhance data accessibility and support research by overcoming limitations associated with real-world medical datasets. However, ensuring fairness across protected attributes in synthetic data is critical to avoid biased or misleading results in clinical research and decision-making. In this study, we assess the fairness of synthetic data generated by multiple generative adversarial network (GAN)-based models using the MIMIC-III dataset, with a focus on representativeness across protected demographic attributes. We measure subgroup representation using the logarithmic disparity metric and observe significant imbalances, with many subgroups either underrepresented or overrepresented in the synthetic data, compared to the real data. To mitigate these disparities, we introduce MedEqualizer, a model-agnostic augmentation framework that enriches the underrepresented subgroups prior to synthetic data generation. Our results show that MedEqualizer significantly improves demographic balance in the resulting synthetic datasets, offering a viable path towards more equitable and representative healthcare data synthesis.
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > Ohio (0.05)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Asia > Middle East > Israel (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Blockchain-Enabled Explainable AI for Trusted Healthcare Systems
This paper introduces a Blockchain-Integrated Explainable AI Framework (BXHF) for healthcare systems to tackle two essential challenges confronting health information networks: safe data exchange and comprehensible AI-driven clinical decision-making. Our architecture incorporates blockchain, ensuring patient records are immutable, auditable, and tamper-proof, alongside Explainable AI (XAI) methodologies that yield transparent and clinically relevant model predictions. By incorporating security assurances and interpretability requirements into a unified optimization pipeline, BXHF ensures both data-level trust (by verified and encrypted record sharing) and decision-level trust (with auditable and clinically aligned explanations). Its hybrid edge-cloud architecture allows for federated computation across different institutions, enabling collaborative analytics while protecting patient privacy. We demonstrate the framework's applicability through use cases such as cross-border clinical research networks, uncommon illness detection and high-risk intervention decision support. By ensuring transparency, auditability, and regulatory compliance, BXHF improves the credibility, uptake, and effectiveness of AI in healthcare, laying the groundwork for safer and more reliable clinical decision-making.
An Analytical Approach to Privacy and Performance Trade-Offs in Healthcare Data Sharing
Wei, Yusi, Benson, Hande Y., Capan, Muge
The secondary use of healthcare data is vital for research and clinical innovation, but it raises concerns about patient privacy. This study investigates how to balance privacy preservation and data utility in healthcare data sharing, considering the perspectives of both data providers and data users. Using a dataset of adult patients hospitalized between 2013 and 2015, we predict whether sepsis was present at admission or developed during the hospital stay. We identify sub-populations, such as older adults, frequently hospitalized patients, and racial minorities, that are especially vulnerable to privacy attacks due to their unique combinations of demographic and healthcare utilization attributes. These groups are also critical for machine learning (ML) model performance. We evaluate three anonymization methods-$k$-anonymity, the technique by Zheng et al., and the MO-OBAM model-based on their ability to reduce re-identification risk while maintaining ML utility. Results show that $k$-anonymity offers limited protection. The methods of Zheng et al. and MO-OBAM provide stronger privacy safeguards, with MO-OBAM yielding the best utility outcomes: only a 2% change in precision and recall compared to the original dataset. This work provides actionable insights for healthcare organizations on how to share data responsibly. It highlights the need for anonymization methods that protect vulnerable populations without sacrificing the performance of data-driven models.
- North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
- North America > United States > Alaska (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
Processing of synthetic data in AI development for healthcare and the definition of personal data in EU law
Vallevik, Vibeke Binz, Befring, Anne Kjersti C., Elvatun, Severin, Nygaard, Jan Franz
Artificial intelligence (AI) has the potential to transform healthcare, but it requires access to health data. Synthetic data that is generated through machine learning models trained on real data, offers a way to share data while preserving privacy. However, uncertainties in the practical application of the General Data Protection Regulation (GDPR) create an administrative burden, limiting the benefits of synthetic data. Through a systematic analysis of relevant legal sources and an empirical study, this article explores whether synthetic data should be classified as personal data under the GDPR. The study investigates the residual identification risk through generating synthetic data and simulating inference attacks, challenging common perceptions of technical identification risk. The findings suggest synthetic data is likely anonymous, depending on certain factors, but highlights uncertainties about what constitutes reasonably likely risk. To promote innovation, the study calls for clearer regulations to balance privacy protection with the advancement of AI in healthcare.
- Europe > Germany (0.45)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Europe > Netherlands > Overijssel (0.04)
- (8 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government > Europe Government (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.93)
Contextual Phenotyping of Pediatric Sepsis Cohort Using Large Language Models
Nagori, Aditya, Gautam, Ayush, Wiens, Matthew O., Nguyen, Vuong, Mugisha, Nathan Kenya, Kabakyenga, Jerome, Kissoon, Niranjan, Ansermino, John Mark, Kamaleswaran, Rishikesan
Clustering patient subgroups is essential for personalized care and efficient resource use. Traditional clustering methods struggle with high-dimensional, heterogeneous healthcare data and lack contextual understanding. This study evaluates Large Language Model (LLM) based clustering against classical methods using a pediatric sepsis dataset from a low-income country (LIC), containing 2,686 records with 28 numerical and 119 categorical variables. Patient records were serialized into text with and without a clustering objective. Embeddings were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was applied to these embeddings. Classical comparisons included K-Medoids clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and statistical tests evaluated cluster quality and distinctiveness. Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B with the clustering objective performed better with higher number of clusters, identifying subgroups with distinct nutritional, clinical, and socioeconomic profiles. LLM-based methods outperformed classical techniques by capturing richer context and prioritizing key features. These results highlight potential of LLMs for contextual phenotyping and informed decision-making in resource-limited settings.
- North America > Canada > British Columbia > Vancouver (0.05)
- Africa > Uganda > Western Region > Mbarara District (0.05)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- (8 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
AIhub monthly digest: May 2025 – materials design, object state classification, and real-time monitoring for healthcare data
Welcome to our monthly digest, where you can catch up with any AIhub stories you may have missed, peruse the latest news, recap recent events, and more. This month, we learn about drug and material design using generative models and Bayesian optimization, find out about a system for real-time monitoring for healthcare data, and explore domain-specific distribution shifts in volunteer-collected biodiversity datasets. Ananya Joshi recently completed her PhD, where she developed a system that experts have used for the past two years to identify respiratory outbreaks (like COVID-19) in large-scale healthcare streams across the United States. In this interview, she tells us more about this project, how healthcare applications inspire basic AI research, and her future plans. Onur Boyar is a PhD student at Nagoya university, working on generative models and Bayesian methods for materials and drug design.
- Information Technology > Security & Privacy (0.61)
- Health & Medicine > Consumer Health (0.61)
Interview with Ananya Joshi: Real-time monitoring for healthcare data
In this interview series, we're meeting some of the AAAI/SIGAI Doctoral Consortium participants to find out more about their research. Ananya Joshi recently completed her PhD, where she developed a system that experts have used for the past two years to identify respiratory outbreaks (like COVID-19) in large-scale healthcare streams across the United States using her novel algorithms for ranking real-time events from large-scale time series data. In this interview, she tells us more about this project, how healthcare applications inspire basic AI research, and her future plans. When I started my PhD during the COVID-19 pandemic, there was an explosion in continuously-updated human health data. Still, it was difficult for people to figure out which data was important so that they could make decisions like increasing the number of hospital beds at the start of an outbreak or patching a serious data problem that would impact disease forecasting.
- North America > United States > Texas > Loving County (0.05)
- North America > United States > New York (0.05)
Leveraging Generative AI Through Prompt Engineering and Rigorous Validation to Create Comprehensive Synthetic Datasets for AI Training in Healthcare
Access to high-quality medical data is often restricted due to privacy concerns, posing significant challenges for training artificial intelligence (AI) algorithms within Electronic Health Record (EHR) applications. In this study, prompt engineering with the GPT-4 API was employed to generate high-quality synthetic datasets aimed at overcoming this limitation. The generated data encompassed a comprehensive array of patient admission information, including healthcare provider details, hospital departments, wards, bed assignments, patient demographics, emergency contacts, vital signs, immunizations, allergies, medical histories, appointments, hospital visits, laboratory tests, diagnoses, treatment plans, medications, clinical notes, visit logs, discharge summaries, and referrals. To ensure data quality and integrity, advanced validation techniques were implemented utilizing models such as BERT's Next Sentence Prediction for sentence coherence, GPT-2 for overall plausibility, RoBERTa for logical consistency, autoencoders for anomaly detection, and conducted diversity analysis. Synthetic data that met all validation criteria were integrated into a comprehensive PostgreSQL database, serving as the data management system for the EHR application. This approach demonstrates that leveraging generative AI models with rigorous validation can effectively produce high-quality synthetic medical data, facilitating the training of AI algorithms while addressing privacy concerns associated with real patient data.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Missouri (0.04)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Health Care Technology > Medical Record (1.00)
Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML
Salari, Mohammad Amir, Rahmani, Bahareh
The application of AI in healthcare allows for the identification of complex patterns in patient data, improving diagnostic accuracy, treatment personalization, and operational efficiency [1]. Healthcare providers are increasingly leveraging predictive analytics to foresee health outcomes, enabling earlier interventions and more targeted care [2][26]. For instance, AI models have proven effective in identifying high-risk patients and optimizing preventive care strategies [3]. Diabetes, a major global health challenge, requires early detection and preventive care. Predictive models built using accessible tools like BigQuery ML can help healthcare professionals identify at-risk individuals efficiently. Cloud computing serves as a critical tool for AI and ML in healthcare, addressing many of the technical and infrastructural challenges associated with large-scale data analysis. With scalable infrastructure, cloud platforms allow healthcare providers to process and store vast amounts of data, facilitating AI-driven insights without the need of extensive on-site resources [4].
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.71)
- Information Technology > Data Science > Data Mining > Big Data (0.68)